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Abstract: We consider Content Centric Network (CCN) interest forwarding problem as a Multi- 
Armed Bandit (MAB) problem with delays. We investigate the transient behaviour of the e-greedy, 
tuned e-greedy and Upper Confidence Bound (UCB) interest forwarding policies. Surprisingly, for 
all the three policies very short initial exploratory phase is needed. We demonstrate that the tuned 
e-greedy algorithm is nearly as good as the UCB algorithm, the best currently available algorithm. 
We prove the uniform logarithmic bound for the tuned e-greedy algorithm. In addition to its 
immediate application to CCN interest forwarding, the new theoretical results for MAB problem 
with delays represent significant theoretical advances in machine learning discipline. 
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Routage des Interets dans CCN comme le Probleme de 
Bandit-Manchot avec des Retards 



Resume : Nous considerons le routage des interets dans CCN (Content Centric Network- 
ing) comme le probleme de bandit-manchot avec des retards. Nous etudions le comportement 
transitoire des politiques : e-greedy, tuned e-greedy et Upper Confidence Bound (UCB). Eton- 
namment, pour tous les trois politiques on a besoin d'un tres court premiere phase exploratoire. 
Nous demontrons que I'algorithme tuned e-greedy est presque aussi bon que I'algorithme UCB, le 
meilleur algorithnie actuellenient disponible. Nous etablissons la limite uniforme logarithmique 
pour I'algorithme tuned e-greedy. En outre de son application immediate au routage des in- 
terets dans CCN, les nouveaux resultats theoriques pour le probleme de bandit-manchot avec 
des retards representent des avancees importantes dans la discipline I'apprentissage automatique. 

Mots-cles : Information Centric Networks, Content Centric Networks, Routage des Interets, 
Probleme du Bandit-Manchot avec des Retards 



CCN Interest Forwarding Strategy 



3 



1 Introduction 

There is a conceptual clash between rapidly expanding digital information dissemination and the 
host-based network architecture of the current Internet. To facilitate the dissemination of dig- 
ital information, several Information-Centric Network (ICN) architectures have been proposed: 
TRIAD 0, DONA [TO], CCN/NDN 0. Since the CCN/NDN (Content-Centric Networking / 
Named Data Networking) proposal appears to be the most elaborate, we develop our contribu- 
tion in the framework and within the terminology of CCN/NDN. For the sake of brevity, we shall 
refer to CCN/NDN as CCN. The main features of the ICN paradigm, and the CCN architecture 
in particular, are that the content is addressed by a unique name and can have many identical 
cached copies. Any of such copies can be retrieved independently of its location. The content 
is typically divided into several small chunks. A chunk is also uniquely identified. A chunk of 
content is located and requested by forwarding so-called interests. A user or a CCN router can 
forward interests to one or more neighbour CCN routers. Clearly, if there is no bandwidth limi- 
tation the most efficient way is to forward interests to all available neighbour routers. However, if 
there is a bandwidth limitation or the interest sender has to pay for the interest or/and delivered 
content, there can be better interest forwarding strategies than simple flooding. 

In the present work we suggest to view the problem of optimal interest forwarding strategy 
as a Multi- Armed Bandit (MAB) problem. The MAB problem is a classical problem in machine 
learning discipline in which a decision maker finds an optimal balance between exploration and 
exploitation efforts. Here we adopt three well known algorithms from MAB literature: e-greedy 
|12| . tuned e-greedy and UCB [T]. Our study brings advances to both networking and machine 
learning disciplines. We show that the MAB algorithms allow to detect the optimal router with 
very small number of interests sent to sub-optimal routers. The novelty from machine learning 
perspective is that we analyze the transient period of the MAB algorithms with delays. This is 
a very challenging topic with hardly any results available in the literature. In fact, we can only 
cite the work [3] on MAB with delay. However, the model in [3] is different from ours and there 
are many restrictive assumptions. 

We expect that our MAB-based mechanisms can be integrated in the Interest Control Protocol 
(ICP) which regulates the pacing of interests [3]. 

The paper is organized as follows. In Section [2] we present a formal model of the problem 
and describe three algorithms that we propose for CCN interest forwarding. We analyze the 
initial exploratory phase of these algorithms in Section [31 both numerically and mathematically, 
providing a bound and an approximation of its duration. In Section |4] we study the exploitation 
phase of the tuned e-greedy algorithm and prove a logarithmic bound on the probability of 
choosing a suboptimal router. Section [5] concludes. 

2 Model and interest forwarding strategies 

We suppose that a CCN router or a user can forward interests to K CCN neighbour routers. 
We consider a discrete time model. The slot duration can be chosen equal to the minimal 
duration of packet generation at the MAC layer. Therefore, we assume that at each time slot 
t G T := {0, 1,2,...} the user can send only one interest to one of K CCN neighbour routers. 

CCN routers reply with delays distributed according to discrete distribution functions Fk{x), 
k = l,...,/v, X = 1,2,... with mean denoted by fik- Specifically, we assume that a chunk 
corresponding to the interest generated at the present slot and forwarded to the neighbour 
router k is delivered by router k after a random number of slots distributed according to the 
distribution function Fk{x). Thus, we shall know the effect of the action taken at the time slot 
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t only at the future time slot t + Xk{t), where Xk{t) is an i.i.d. random variable generated 
according to Fk{x). 

We are interested in minimizing the expected number of interests sent to sub-optimal routers, 
or to sub-optimal arms in terminology of the multi-armed bandit framework |12| . The challenging 
novelty of our setting with respect to the classical multi-armed bandit problem formulation is 
that the cost becomes known to the decision maker with delays. In fact, the costs are the delays. 

The optimal policy in the classical setting without delay is obtained by the Gittins index rule 
[5]; which breaks the combinatorial complexity of the problem by computing the Gittins index 
(a history-dependent function) for each router in isolation and then simply sending the interest 
at every slot to the router whose current Gittins index value is lowest. This result significantly 
reduces the dimensionality of the problem, but the evaluation of the Gittins index may still be 
computationally tedious, especially if the index depends on the whole history, not only on the 
last observed state. Moreover, the Gittins optimality result requires that the evolution of costs 
from routers be mutually independent, while the algorithms described below are efficient even 
for dependent arms [T]. 

Since strictly speaking optimal policy is very likely to be very complex even in the classical 
setting without delay, many researchers have proposed sensible policies and shown desirable 
properties of such policies [HI H]. One desirable property of the multi-armed bandit problem 
policy is the uniform logarithmic bound on the number of sub-optimal arms chosen by the 
decision maker. We shall establish the uniform logarithmic bound for the tuned e-greedy policy 
in the case of delayed information in Section 

In the present work we consider the following three algorithms: e-greedy algorithm, tuned 
£-greedy algorithm, and UCB (Upper Confidence Bound) algorithm. These are the most used 
multi-armed bandit algorithms, and in this paper we propose their generalizations to the setting 
with delayed information. 

Let us formally describe each algorithm. The e-greedy algorithm is the simplest algorithm. 
Its main drawback is that the expected number of sub-optimal arms grows linearly in time. A 
variant of e-greedy algorithm was proposed in |12| for Markov Decision Process models without 
delay. 

Denote by (t) the total number of interests sent to router k and answered up to the end of 
slot t — 1, and 



Algorithm e-greedy 

1. Initialization: Choose to G T and e S (0,1). During the first to slots keep sending 
interests to routers in round robin fashion or randomly to routers chosen according to the 
uniform distribution. 

2. at each time slot t > to do 

3. For each router /e, compute the average delay: 



Ak{T,t) := Ijinterest sent to fc at r 

and answered up to the end of slot i — 1}. 




4 



For each router fc, set the index: 



Inria 



CCN Interest Forwarding Strategy 



5 



5. With probability 1 — e send new interest to the router with the smallest index or with 
probability e send new interest to a uniformly randomly chosen router. 

6. end for 

The tuned e-greedy algorithm and UCB algorithm for models without delays have been 
proposed and analysed in [T]. Both the tuned e-greedy and UCB algorithms have logarithmic 
bounds on the number of sub-optimal arms in the case of no delays [T]. 

Algorithm tuned e-greedy 

1. Initialization: Choose to £ T and Eq G (0,to)- During the first to slots keep sending 
interests to routers in round robin fashion or randomly to routers chosen according to the 
uniform distribution. 

2. at each time slot t > to do 

3. For each router fc, compute the average delay: 



5. With probability 1 — Sa/t send new interest to the router with the smallest index and with 
probability eo/t send new interest to a uniformly randomly chosen router. 

6. end for 

Algorithm Upper Confidence Bound (UCB) 

1. Initialization: Choose to E T and L > 0. Dining the first t^ slots keep sending interests 
to routers in round robin fashion or randomly to routers chosen according to the uniform 
distribution. 

2. at each time slot t > to do 

3. For each router fc, compute the average delay: 




t-i 



4, 



For each router fc, set the index: 



1 



Y^Au{T,t)Xu{T) 



k,Tk(t) 



4. For each router k, set the index: 




where L is so-called exploration parameter. 



5. Send new interest to the CCN router with the smallest index. 
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Parameters 


Router 1 


Router 2 


Router 3 


propagation delay 


2 


2 


2 


p parameter 


0.8 


0.7 


0.6 


r parameter 


10 


10 


10 


mean delay 


4.5 


6.29 


8.67 


std 


1.77 


2.47 


3.33 



Table 1: The values of parameters in the numerical example. 
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Figure 1: Negative binomial distributions in example. 



6. end for 

In our case, since we minimize the cost, we should more appropriately call this algorithm the 
lower confidence bound algorithm. However, to make an explicit connection with [Ij we shall 
continue to call it the UCB algorithm. In the previous works the UCB algorithm have shown 
slightly better performance than the tuned e-greedy algorithm. 

To get an idea of the performance of the above algorithms in the presence of delay, we provide 
a numerical example. In our numerical examples as the distribution of delay Fk{x), we have taken 
the negative binomial distribution with deterministic shift. There are several reasons for this 
choice. The negative binomial distribution is quite versatile. With two parameters, we can 
easily choose any mean and variance, which have simple explicit expressions. The distribution 
shape can take diverse forms such as the shape of geometric distribution and the shape close to 
that of the normal distribution. The negative binomial distribution represents the distribution 
of a sum of geometrically distributed random variables. Since the waiting time distribution in 
many queueing systems is exponential or close to exponential, the negative binomial distribution 
represents well the response time of queueing systems in cascade. We introduce the deterministic 
shift to model the propagation delay. In Table [T] we present the parameters of our numerical 
example and in Figure[T]we plot the negative binomial distributions with the chosen parameters. 

In Figure [5] we plot the fraction of interests sent to the optimal arm as a function of time for 
the three algorithms with Round Robin strategy employed in the initial phase. This numerical 
example demonstrates that despite the presence of delays, the three algorithms perform well. In 
particular, as in the case of no delay, the performances of the UCB and tuned e-greedy algorithms 
are comparable and the e-greedy algorithm performs not too badly. In the following sections we 
will provide a detail analysis of these three algorithms. 
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Figure 2: Comparison of MAB algorithms. 
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Figure 3: The effect of the initial phase duration and initial strategy: e-greedy algorithm. 

3 Analysis of initial exploratory phase 

Let us now investigate the effect of the duration of the initial, purely exploratory, phase on the 
algorithm performance. We shall consider two possible initial strategies: the Round Robin (RR) 
strategy and the strategy when the arm chosen randomly with uniform probability (Uni). Note 
that in the Round Robin strategy the initial arm and the order are chosen randomly with uniform 
distribution. 

In Figures for our numerical example we plot the fraction of interests sent to the optimal 
arm for different durations {to = 3, 9, 30) of the initial phase for different algorithms with different 
initial phase strategies. 

A bit surprisingly, it turns out that it is better to set up very short duration of the initial 
phase. Another important observation is that it is better to use the Round Robin initial strategy 
rather than the uniformly random strategy. This is intuitively expected as by using the Round 
Robin strategy we reduce the randomness. Below we provide theoretical explanation of these 
phenomena. 

The initial phase [0, to — 1] is characterized by large exploration effort. Here we would like to 
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Figure 4: The effect of the initial pliase duration and initial strategy: tuned e-greedy algorithm. 

provide an estimate for the period after which we can with high certainty rely on the choice of 
the best performing arm based on evaluated averages. Specifically, let us estimate the probability 
of choosing the best arm (denoted by *) given the arms are chosen independently before the end 
of the initialization phase. 

Denote by It the arm chosen at time slot t. Assume first that arms are chosen randomly and 
independently during the initial phase with probability pj := E[l{/t = j}], j = 1, ...,K. In the 
case of uniformly random strategy we have pj = 1/ K. Let further D be the maximum possible 
delay between choosing the arm and observing the realization {D = 1 corresponds to no delay, 
i.e., receiving the chunk always in the slot immediately after the slot when an interest was sent) 
and 

, A.,- A., 

where Aj = /ij — jj,^. Then, we have the following result. 

Theorem 1 // during the exploration phase we choose the arms randomly and independently 
with uniform distribution (pj = 1/K), and at the end of the exploration period, at slot to, we 
choose the arm according to the estimated average, the probability of choosing the best arm is 
lower bounded by 

A strong point of the above result is that the derived lower bound is given in terms of expo- 
nential function, which means that starting from some value of to the probability of success will 
be very high. However, the bound ((T)) can be loose. Therefore, next we suggest an approximation 
of the success probability based on the central limit theorem. 
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Figure 5: The effect of the mitial phase duration and initial strategy: UCB algorithm. 

Also, it turns out that if the maximal delay is not too large, we do not introduce a large error 
by considering only interests sent by the time to — D. Then, by the time we observe reply 
from all sent interests. 

Theorem 2 // during the exploration phase we choose the arms randomly and independently 
with uniform distribution (pj = = 1/K), and if at the end of the exploration period, at slot to, 
we choose the arm according to the estimated average, the probability of choosing the best arm 
can be approximated as follows: 



where $(•) is the cumulative distribution function of the standard normal random variable. 

In the case when the Round Robin strategy is used in the initial phase, we can provide even 
sharper approximation. 

Theorem 3 // during the exploration phase we choose the arms according to the Round Robin 
strategy with the first arm and the order chosen randomly with the uniform distribution, and if 
at the end of the exploration period, at slot to, we choose the arm according to the estimated 
average, the probability of choosing the best arm can be approximated as follows: 







P[X*,T,{to-D) < minXj-T,(to-D)] 




(3) 
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Figure 6: Approximations for the probability of choosing the optimal arm at the end of the initial 
phase. 

We consider now our numerical example with truncated negative binomial distributions with 
D = 15. In Figure [6] we plot the approximations ([2]) and ([3]), which firstly confirm that it is 
enough to have a very short initial phase and secondly confirm our intuition that the Round 
Robin strategy is better than the random strategy. 

One may be interested in rough estimation of the number of time slots after which using 
estimated averages the optimal arm will be selected with high probability. We can provide 
recommendation for such value based on ([3)) and 2-sigma rule. If the arguments of the standard 
normal distribution function are equal to two, then respective probabilities are greater than 
0.977. Thus, we conclude that after the time 

T>D \ ^g ^«^(^*) + "iaxj^ar(Xj) 
~ min, A? ' 

using the estimated averages and the RR strategy, we select the optimal arm with probability at 
least 0.977^~^. In our numerical example, after 68 time slots the probability of choosing correctly 
the optimal arm is estimated to be more than 0.95. This is even a conservative estimation and 
in reality we need even shorter exploratory period. 

4 Logarithmic bound for the tuned e-greedy algorithm 

In this section we finally prove that the regret (cumulative suboptimality) of employing the 
tuned e-greedy algorithm is bounded logarithmically in i, which is the same result as for the case 
without delay (and known to be the best possible) [T]. 

Theorem 4 Let a > and < d < mink:p,^>fj., A^, and let initial phase be run with the 
uniformly random strategy. For all K > 1 and for all delay distributions i^i , . . . , Fk with support 
in [1,D], if algorithm tuned e-greedy is run with input parameters tg > Eq ;= aK/d^, then the 
probability that the algorithm chooses in slot t > to a suboptimal arm j is at most 



a 



2D— In 



td^e^/^\ f aK 



d^ \ aK J ytd'^e^/^, 
16D^ ( D + l]^ / aK \^ a 
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This bound says that the cumulative probabihty of suboptimal decisions is logarithmic for 
a large enough (surely if a > max{14(i^/3, 81?^}), because the instantaneous suboptimality at 
any slot t > to is of the order {K — l)a/(Pt + o{l/t) for t — > oo. We conclude that the smaller 
the number of arms (CCN neighbour routers) and the larger d, the difference between the mean 
delays of the best and the strictly second-best arm, the better is the performance of the tuned 
e-greedy algorithm. 

5 Conclusion 

The contribution of this paper is twofold. First, we have proposed tractable and well-performing 
interest forwarding algorithms for CCN networks. We have demonstrated that the algorithms 
work fast and logarithmically few interests are send suboptimally, which means that the resources 
of the user and CCN routers are efhciently managed. Theoretical bounds show that the learning 
process is best achievable. 

Second, we have also contributed to the theory of the multi-armed bandit problem with 
delayed information. This is an important and challenging topic with few existing results. We 
have provided finite-time analysis of algorithms extended to this setting and showed that the 
deterioration of their performance due to delays is not significant. Perhaps surprisingly, there is 
no need to include a long exploratory phase, just a single datum from each arm is sufficient for 
an efficient performance of the algorithms. 
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A Appendix: Proofs 
A.l Auxiliary Material 

Let us state concentration inequalities to be used in the proofs of the theorems. We first state 
the Chernofi'-Hoeffding bound in a general form. This is called the Hoeffding's inequality in [TTl 
p. 191[, citing [71 . 

Theorems (Chernoff-Hoeffding bound) Let Yi,Y2,...,Yt be independent random vari- 
ables with zero means and bounded ranges at <Yt <bt. Then, for each rj > 0, 



Let us state also the Bennett's inequality and its consequence, the Bernstein's inequality. 

Theorem 6 (Bennett's inequality) Let Yi,Y2, . . . ,Yt be independent random variables with 
zero means and bounded ranges —M < 1* < M. Write a' for the variance ofYf. Suppose 
V > af + ■ ■ ■ + a^. Then, for each rj > 0, 



where B{X) := 2X-^[{1 + A)log(l + A) - A], for A > 0. 

According to [H p. 193[: 

"The function B{-) is well-behaved: continuous, decreasing, and B{0+) = 1. When 
A is large, B{X) w 2A~^ log A in the sense that the ratio tends to one as A — >■ oo; the 
Bennett Inequality does not give a true exponential bound for ?/ compared to V/M. 
For smaller rj it comes very close to the bound for normal tail probabilities. Problem 
2 shows that B{X) > (1 + ^X)-^ for all A > 0." 

Using the last bound, we get the Bernstein's inequality. 




F[Yi+Y2 + --- + Yt < -1]] < exp 



¥[Yi+Y2 + ---+Yt> 1]] < exp 
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Yt > T]] < cxp <^ --rf/ V + -Mr] 



Theorem 7 (Bernstein's inequality) Let Yi,Y2, . . . ,Yt be independent random variables with 
zero means and bounded ranges —AI < Yt < M. Write for the variance of Yt- Suppose 
V > + ■ ■ ■ + (j^. Then, for each rj > 0, 

¥[Yi+Y2 + --- + Yt< -v] < cxp l-hf/ (v + hlv] } , 



Finally, we present the Azuma's inequality. 

Theorem 8 (Azuma's inequality) Let Zt be a martingale with zero mean and bounded incre- 
ment, i.e., 

\Zt-Zt-i\ <c{t), 

almost surely. Then, for all positive integers t and all positive reals X, we have 

( y? \ 

P[Zt > A] < exp 



A. 2 Proof of Theorem [T] 

We need to evaluate the following probability: 



= ]^P[X„T.(to) < ^3,T,(to)] 



A 



A 



A., 



Now let us estimate the probability P[X._^ x,{to) < yu* + %-]• 

P[X*.T.(t„) < M* + ^] = 1 - P[X,,T,ito) > M 

Eti - *}X,{s)l{s + X4s) < to} 



1 - P 



Es=iHls = ^}H-^ + X4s)<ta} 



> Ai* + 



1-P 



to 



(5) 



J2 His = - + X,{s) <to}>^J2 His = + X,{s) < to} 



1 - P 



to 



His = *}{X4s) - + X4s) < to} 



A, 



to 



to 



^(1{/, = *}-p*)l{s + X,(s) < to} > ^P*^l{s + Ar,(s) < to} 



s=l 
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1 - P 



J2 His = *}{X4s) - + X4s) < to} 



,s=l 



A, 



A, 



J2iHIs = *}~P*)Hs + X4s)<to}- ^p.Y.^l{s + X4s)<to}-q*^to-s) 



where q,^^ P[X^{t) < i]. 
Next we define 



D 



i=l 



-^$](lR = *}-P*)l{s + ^*(s) <t} 



A 



It is a martingale (with respect to the sequence of the observed delays) with zero mean and 
bomided increment 

\Zt - Zt-\ \ < Cj, 

with cj = + + ^p^D. 

Thus, we can apply Azuma's inequality for martingales, which gives in our case 



A,. ^ / ^',/'ipl{to-D + Ef=i'l*.^) 
—r-\>l- cxp I 



2cpo 



> 1 — exp — 



Similarly, we have 



P{X^,T,it„)>l^j-^]>l-oxp 



2c]t 



Substituting ^ and (O into (O, we complete the proof. 

A.3 Proof of Theorem H 

Similarly to ([5]), we have 

P[X*,T.[to-D) < minXj^Tj(to-D)] 



(6) 



(7) 



(8) 
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Define 



* / A 



s=l 



Then, we can use the Central Limit tlieorem to estimate the probabiUty 

P[X.,T,(to-D) < + ^] = P[Yt„-D < ^P*{to - D)] 



which gives 



PI 



to-D 



^{to - D){p,Var{X,) + Ajp^l - p,)/A) 

A.p^tp-D) J 

2^ito - D)ip,Var{X,) + A]p,il - p,)/A) ' 



< 



P[X*,T,{to~D) < M* 



2^p^Var{X^) + Ajp^l - p^)/4: ^ 
where $(•) is the standard normal distribution function. Similarly, we obtain 

A, 



AjPjy/to - D 



2Jp,VariX,)+Ajp,{l-p,)/4^ 



(9) 



(10) 



The substitution of 1^ and (HUl) into ^ yields the result. 

The proof of Theorem 3 is simpler than the proof of Theorem 2 and it is omitted. 

A. 4 Proof of Theorem [4] 

Note that the assumption t > to means that we are in the exploitation phase, and let us denote 
by £( := Eo/t for all t > to, while St := 1 for all t < to- 
Let Xj^s be the sample mean of observed delays (costs) if arm j was chosen s times conditioned 
on the delay distribution. Let Xj^s,u be the sample mean of observed delays if arm j was chosen 
s times having obtained u < s observations. Let Sj{t) denote the number of times arm j was 
chosen in the first t slots [0, i — 1]. Recall that It denotes the arm chosen at slot t. Then we have 



'[It=j]<{l-et)] 



Xj,s,(t) < maxXfe^Sfc(t) 



+ 



El 
K' 



Note that here we have an inequality in order to account for an arbitrary rule of breaking ties 
in deciding the arm to choose in case several arms have the same lowest sample mean. 
If i 7^ * (where * denotes any of the best arms) , then we can bound it by 



K 



< 



— ^3 ' 

Xj,S,it) <t^3 - — 



— ^/ 



K' 



(11) 
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Let now Uj^s(t) denote the number of observed realizations by the beginning of slot t from 
arm j given that it was chosen s times in the slots [0, t — 1] . In order to upperbound the first two 
terms in (jlip (by an expression independent of j), let us study the following expression next. 



A. 



s=l 



Sj{t) = s and Xj^s > fJ-j H — ^ 



= E 

t 

= E 

8 = 1 
t 

= E 



S,{t) = s I X,-, > + ^ 
S,{t) = s I X,-, > + ^ 



— A, 



E^ 

s 

E^ 



Uj^s{t) = u and > fJ.j + ^ 



X j.s.u ^ + 

(12) 



Assuming that 



Xj,s,u > A^j + 2^ 



> 0, then, for 1 < u < s, 



= 0, if s - D + 1 > w, 
< 1, if s - D + 1 < M, 



because there can be at most D —1 unobserved realizations of the chosen arms {s — u < D — 1) 
Hence, 



E' 

11=1 



Uj.sit) = U I Xj^,,u >fij + ^ 



X'j,s,u > + — 



^ E ^ 

u— max{l,s — D + l} 



< 



E 



exp ■ 



u— max{l,s — D + l} 



E '^^P' 

u — m ax { 1 , s — Z) 4- 1 } 



where the last inequality is due to the ChernofF-Hoeffding bound (employed with 1] — , ht = 

D, at = — T ^ u). 

Upperbounding the last geometric sum by a sum of constants equal to the first term, we 
further have 



El 



Xj,s,u > + 



I A? 

< £iexp<{ -^max{l,s-i:> + l} 
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This bound plugged into ((T^) therefore gives us 

— ^3 

Xj.S,{t) >tJ-j + ^ 



< 



dJ2f 

oo 

< -D^P 

s=l 

< D exp I 

s=D 
oo 



cxp ■ 



exp ■ 



8L>2 



c{l,s-D + l} 



8D^ I ^ 



^max{l,.s-D + l} 
A ' 



S,{t)^s\X,.,>f,, + ^ 



exp ■ 



8D^ 



s=L-EJ + l 



exp ■ 



{s-D + 1) 



A? 



8D 



(13) 



where 



1 



Note that if IE\ > D — 1, then the above decomposition of the sum in the last step in fact 
holds as equality. In case IE\ < D — 1, the second term is zero and some of the summands 
appear both in the first and in the third term, therefore the inequality holds. 

The sum of the first and second terms in (|13p can be upperbounded by 



IE] 



omitting the exponential terms (< 1), which is further upperbounded (as in [T]) by 

A, 



^E' 



5f (t) < s I X,- , > + ^ 



< DE¥[Sf{t) < E] , 



where S^{t) < Sj{t) is the the number of times arm j was chosen in the first t slots [0,t — 1] 
at random. Using the Bernstein inequality (with Kj+i for s = 0, 1, ... ,t — 1 being the random 
variable of sending the interest to router j at slot s, with expected value Es/ K, bounded hy M ~ 1, 
and variance cr^^j = (1 - ej^)(0 - Ss/K)^ + es/K{l - e^lKf = (1 - ejK)es/K < Ss/K, so 
that V = 2E, and taking = E), we have (a slightly tighter upperbound than in [1]) 

V[Sfit)<E] <exp|-Aij| 
and for t > aK/cP, we lowerbound as in [T] (denoted xq there), 



E > — n . 

- d2 aK 



(14) 
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Therefore, the sum of the first and second terms in (fT5)) can be upperbounded by 



aK 



aK 



As in [T], the third term in psp can be upperbounded by 



^exp. 



3. 

8L>2 



m-D) 



exp ■ 



8L>2 



D > exp ■ 



omitting the probabihty term (< 1) and using e~"^ < — e""'', with r = \E \ — D, i 

a 

s— r+l 

Further, using [_BJ > -E — 1, this can be upperbounded by 

'a|cd + i)" 



8D2 



^exp. 



8i:>2 



exp ■ 



— 



and further by 



8^3 



■ exp 



D'^{D + 1)\ ( aK 



8L'2 



where the bound for the third term is obtained using ([14)) . 
So, we have 



<P 



■ exp 



2 

£> + 1 



< D 



aK 



hi- 



td2ei/2 



In fact, the same upperbound holds for 



which is the second term in 



Finally, we have et = aK/cPt to plug in the third term in pT|) . therefore 



'[It =j]< 2D- [In 



t<Pe^'^\ f aK 



aK 



■ exp 



D + \\ f aK 



a 
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